Goto

Collaborating Authors

 detector supervision


1fd6c4e41e2c6a6b092eb13ee72bce95-AuthorFeedback.pdf

Neural Information Processing Systems

GVQA (from VQA-CP) builds on stacked attention networks (SAN).13 However,SAN and,byextension, GVQA architectures donotevaluate for,andgeneralize poorly on,17 unseen object attributes (CLEVR-CoGenT) and linguistic structural pattern (CLOSURE) combinations. The language parser is not trained, constructs text (s) object graphs (Gs) using rules-based entity38 recognizer [L126].(W3)CLOSURE ClarityMinor clarifications -5a: corrected inthecamera-ready version.


Common Clarifications: (CC1) Evaluation with other datasets (VQA-CP, GQA) @R1, R2, R4: The main focus and

Neural Information Processing Systems

We thank all the reviewers for their insightful questions, comments and commendations (novelty, clarity, performance). T ask 1 motivation, complexity @R3, R4: Motivation behind Task 1 is efficacy, not complexity [L46-49]. GVQA (from VQA-CP) builds on stacked attention networks (SAN). Thus, they are orthogonal to MGN in problem setting and architecture. We show results using MAC (from the GQA authors, L291) with both CLEVR and GQA results.